Companies, such as Thomson Legal & Regulatory, Inc. of St. Paul, Minn. (doing business as Thomson West), collect and store a vast spectrum of documents, including news, from all over the world, for online access in a system of databases and research tools, known as the Westlaw™ system. The Westlaw system empowers users to search over 100 million documents.
One problem recognized by the present inventors is that searches conducted against news or other databases frequently provide results that include duplicate documents—that is, documents that are completely or substantially identical to each other. The problem stems from news providers, such as Associated Press (AP), selling their news stories for re-publication to multiple publishers around the world. This in turn means that systems, such as the Westlaw system, that provide users searchable access to collections of news stories from a wide array of publishers typically present users with many duplicate copies of news stories in their search results. Unfortunately, the duplicate stories are mixed generally according to relevance with other distinct stories, leaving users to manually manage the complexities of identifying and/or filtering them.
Accordingly, the present inventors recognized a need to effectively address how information-retrieval systems, such as the Westlaw system, handle the existence of duplicate documents in their document collections, and more importantly within the search results of their users.